The Effect of Orthographic Knowledge on Syllable Segmentation: a Cross-linguistic Study
نویسندگان
چکیده
Recent psycholinguistic research has revealed a variety of significant intrasyllabic units, illustrating how segments can cohere into higher-order constituents within a syllable. Among these are the rime (for English and perhaps also Chinese), the body (for Korean), and the mora (for Japanese). However, the native speakers tested so far in all four of the language groups mentioned were all well educated, literate, and often even bilingual. Thus they were all exposed to the writing systems of their own and/or their second language, which (for different reasons in each case) might have predisposed them to perform the way they did. In the present research we are testing speakers of these languages who have not been subjected to the influence of L1 spelling. Results to date show no differences between literate and preliterate Korean children, while research continues with other nonliterate speakers and the other languages. 1. PRIOR RESEARCH The research outlined here is part of a larger cross-linguistic investigation of phonological units in languages of diverse types. Previous research has focused on the status of the syllable (e.g., CVC) and a variety of its hypothesized intrasyllabic constituents, including the segment (C or V), the rime (VC), the body (CV), and the mora (a timing unit that can have several phonetic manifestations, including CV). A variety of diverse experimental tasks has been employed in this effort, including (1) word blending, (2) global sound similarity judgments (SSJs), and (3) concept formation. Using a forced-choice version of the word blending task, for example, it was found that English speakers preferred onset plus rime word blends, with break points before the vowel (e.g., SIEVE + FUZZ > SUZZ), while Korean speakers preferred body plus coda blends, breaking after the vowel (e.g., KANG + SEM > KAM) [1, 2]. In addition, a linear regression analysis of SSJ ratings for CVC-CVC pairs revealed that shared individual segments (Cs or V) and a shared rime (CV) unit all made significant independent contributions to mean similarity scores from English speakers, while it was the body (CV) unit that complemented the segments in the results of a comparable task by Korean speakers; moreover, for CVC pairs sharing two segments out of three, those which shared a final VC were judged to be significantly more similar by English speakers than pairs which shared a common initial CV, whereas the opposite was true for Korean speakers [3, 4]. Furthermore, in a concept formation study (not done in English), Korean speakers found that a target set of words all sharing the common body element /ka/ was easier to master than a target set of words all sharing the common rime element /ak/ [5]. Taken together, these results suggest that English and Korean CVC syllables may be segmented differently, with onset (C) + rime (VC) constituents manifested in English, but body (CV) + coda (C) units in Korean. Other studies have likewise confirmed the possibility of a mora unit in Japanese [6, 7], while the results for Chinese have been mixed, depending on the nature of the task employed. This suggests that a more complicated interplay of factors may be at work in that language family [8]. 2. THE INFLUENCE OF ORTHOGRAPHY Apart from the Chinese case, however, the results of these prior studies are rather remarkably consistent, especially in view of the diversity of experimental tasks that have been utilized, each of which is very different in terms of the kind of responses they call for and/or the levels of metalinguistic awareness that they presumably invoke. In word blending, for example, subjects produce novel blends (or, in a forced-choice version of the task, choose between one blend and another), a task which would seem to overtly direct their attention to intrasyllabic break points, and not necessarily to the whole units that are defined by these breaks. In the concept formation task, however, subjects are directed to overtly discover the phonological units that all members of the target set share, guided by feedback as to which stimuli belong to the target set and which do not. In the SSJ task, by contrast, subjects are not required to attend to either break points or constituent elements at all, but merely to make global intuitive assessments of the overall similarity in sound of each of the syllable pairs presented. The fact that all three tasks can lead to the same conclusions (as in the Korean case) shows, at the very least, that the findings are not the result of a strategy that is linked to any specific task. For all of this, at least one major problem remains which makes the interpretation of the results of these studies difficult. Specifically, all of the subjects tested in the research described above have been literate adults, and mostly highly educated university students, as well, and hence well versed in the orthographic norms of the languages in which they were tested. The factor of orthographic knowledge has therefore not been controlled in any of the studies reported. What kind of biases might knowledge of the writing systems of the languages tested be reasonably expected to introduce? As numerous investigators of child language have pointed out, since standard English spelling contains units (letters and digraphs) that largely represent individual phonemic segments, the writing system provides a major (and perhaps decisive) impetus for the identification of these segments; thus, some argue, it is more than mere coincidence that phonemic awareness arises in Englishspeaking children at just about the same time that they are learning to read [9, 10, 11] (see also [12], which demonstrates deficiencies in segment manipulation skills by Chinese speakers who, though literate in Chinese characters, have not been exposed to either English or any other segment-based transliteration system, such as pinyin.) On the other hand, though awareness of the segment may well be implicated in exposure to a segment-based alphabetic writing system, we have no hard evidence that any larger phonological units are, such as the syllable or the rime. Thus, page 93 ICPhS99 San Francisco while it is true that English letters can sometimes represent strings that are larger than a single segment (witness the letter X, which often represents the sequence /ks/, as in the word MIX, or even a whole syllable, as in X-RAY), such examples are atypical and uncharacteristic of the writing system as a whole. Despite its widespread use of digraphs and its many familiar irregularities and inconsistencies (and even logographic tendencies), this system still utilizes an alphabet that remains strongly attuned to the ÒphonographicÓ tradition of its Graeco-Roman precursors, which is to match individual letters with individual C and V sounds [13]. In sharp contrast, however, the standard Korean orthography contains not only symbols for individual segments but also consistently bundles these elements into syllable-like packages, by stacking the letters in vertical arrays. Moreover, in CVC syllables where the vowel letter is itself written with a vertical orientation, the first two letters (representing CV) are written on the same (top) line, with the letter for the coda consonant written below it. This orthographic convention thus strongly suggests that vowels are more closely associated with preceding consonants than with following ones, introducing a potential bias in favor of a body or CV constituent. In the case of Japanese, where the mora is the unit of primary experimental interest, the case for potential orthographic influence is even stronger, as both of the so-called ÒsyllabariesÓ of Japanese actually consist of symbols that represent individual mora units, rather than either syllables or segments. Thus, for words written in either of these kana, a mora count consistently matches the number of letters used to spell it [14]. Finally, in the case of Chinese, the standard orthography, of course, utilizes ideographic (or logographic) characters, each of which is coextensive with both a single morpheme and a single syllable. This might naturally be expected to introduce a bias in favor of the syllable, but not towards any particular phonological units smaller than this. The complication introduced in the Chinese research, however, has not arisen through the standard orthography, but through a secondary writing system that is widely employed in Taiwan, where most of the prior research on segmentation was carried out. This system, called chuyin-fuhao (or bopomofo, informally), is used for schoolwork in Mandarin during the early school years, while the children are still struggling with the Chinese characters, and it contains symbols which represent individual ÒinitialÓ (onset) and ÒfinalÓ (rime) units. The potential bias so introduced in favor of an onset-rime analysis for Mandarin is obvious, and we cannot discount the possibility that this same bias might also be extended to spoken Taiwanese, given the typological similarities of Mandarin and Taiwanese and the fact that all Taiwanese speakers are required to learn Mandarin in school. In sum, therefore, we can see that (with the possible exception of the rime unit in English), there is potential orthographic contamination with respect to all of the units and the languages focused upon in the line of research involved here. To ensure that this bias was not responsible for the results obtained in the earlier experiments with literate, adult subjects, therefore, we are expanding our tests to include subjects who do not know how to read or write the languages involved and who would thus not be subject to the particular biases that each orthographic convention might introduce. 3. TESTING PRELITERATE CHILDREN Our long-term plans include the testing of three different types of nonliterate speakers, including both illiterate adults and splitliterate bilinguals (such as emigrants to North America who have learned to speak, say, Korean or Japanese near-natively, but have learned to write only English). By far the most accessible of our potential nonliterate subjects, however, are young, preliterate children, so this is the group that we have decided to investigate first. Early pilot testing quickly revealed, however, that someÑand perhaps allÑof the tests that we used with literate adults (such as concept formation) were not well suited for testing children. Thus the first priority in the present endeavor has been the development of new experimental vehicles through which both literate and nonliterate children (as well as adults, with appropriate modifications) might be tested and compared. After much trial and error in pilot work, we seem finally to have arrived at a task that meets that need. We call this protocol the List Recall task. In this task, children are presented with a mixed series of two types of lists, each based on a particular unit of interest. To illustrate for studies comparing the rime vs. the body, two sets of monosyllabic CVC nonsense words are employed, each representing the names of some pictured made-up animals. In one list type, all of the names in a given picture set rhyme, i.e., they all end with the same VC sequence (e.g., /ip/, as in TEEP, HEEP, MEEP), while the names for each picture set of the other type all share a common body or CV sequence (e.g., /ki-/, as in KEET, KEEM, KEETCH). Only nonsense words are used in order to avoid familiarity and frequency effects. Though all these made-up names were phonotactically legal, the range of consonants and vowels used in constructing the items was tightly controlled, in order to permit the English and Korean stimuli to be as similar as possible. Pilot work with English-speaking children indicated that lists that shared common rime elements were learned and remembered more readily than the other lists. Pilot studies also indicated that lists of three or four words each were easy enough to be learned, in whole or in part, with only two or three exposures to each of the names. Finally, a simple Reading Test has also been introduced, in order to separate subjects into groups of readers vs. nonreaders. In this test, subjects are asked to identify a series of 20 pictures of familiar objects whose names are high frequency monosyllabic (CVC) words (e.g., /kek/ 'cake') and then to select the correct spelling of the word from a choice of four alternatives (e.g., CAKE, RAKE, CAVE, HOT). Notice that the first of the incorrect spellings shares a rime element with the correct spelling (the three letters AKE in this case), the second shares a body element with it (the two letters CA), while the last choice has no letters in common with the correct spelling. The four choices were presented in a different order for each word on the test, and with the correct spellings appearing five times in each of the four possible positions; the list of words was presented in a single invariant order to all the subjects. As indicated below, readers were distinguished from nonreaders by comparing their number correct scores (on the full 20 items) with the expectation due to chance (25%). 4. RESULTS At the present time, the only results that we have available involve the testing of literate and preliterate children in English and Korean. page 94 ICPhS99 San Francisco 4.1. Reading Test The English results for the Reading Test were obtained from 45 preschool and first grade children who completed the test, and these scores were used to distinguish readers from nonreaders. A binomial test reveals that, for a forced-choice test of 20 items, with 4 choices on each item, up to 9 of the 20 correct spellings might be identified by chance at the .05 level; since we had no subjects who chose precisely 9 correct spellings, we adopted the more conservative score of 8 correct spellings or less to identify those children who were evidently merely guessing on the Reading Test. A total of 24 Canadian children scored from zero to 8 correct on the Reading Test and these children were classified as the nonreaders. Almost all of the remaining Englishspeaking children tested scored between 14 and 20 on the Reading Test, so this range was selected to define the class of readers. (Note that 14 correct out of 20 represents the selection of the correct spellings in over two-thirds of the items presented, and is a score that would be expected to happen by chance with a probability of less than one in a hundred thousand [p < .00001].) A total of 21 children fell into this category. The Korean version of the Reading Test was constructed on the same principles as the English version. This test also involved 20 pictured high frequency monosyllabic words and the systematic presentation of one correct and three incorrect alternative spellings for each. The criteria for defining readers and nonreaders were also the same. The results summarized below came from the 15 Korean nonreaders and 19 readers tested so far. 4.2. List Recall Test The stimuli for the English version of the List Recall Test comprised 12 lists of monosyllabic nonsense words, with 6 lists in the rime category and 6 in the body category, plus a practice list of each type. In each of these categories, four of the lists contained 3 words each (corresponding to sheets with three animal pictures on them) and two of the lists contain 4 words each (with four animal pictures per sheet). The most important thing about these stimuli was that the 3 (or 4) members of each list in the rime category all rhymed (that is, they all ended in the same VC sequence), while the members of each list in the body category all contained the same body unit (i.e., they all begin with the same CV). Note, finally, that members of all 12 lists shared two phonemes out of three, namely, the vowel and one of the two consonants. A parallel set of stimuli was also constructed for the Korean version of the List Recall Test. For each English or Korean child, the 12 sheets of animal pictures were mixed together and shuffled anew, so that every child was presented with a new random and intermixed ordering of rime and body lists. After two practice trials (one of each type), the sheets were then shown to a child one at a time and the names of each of the 3 (or 4) animals given; the child was then asked to repeat each name aloud, as the experimenter pointed to it. For each set of pictures, this sequence of hearing and then repeating the animalsÕ names (always in the same order) was repeated twice (for readers) or three times (for nonreaders). (This variation in the number of repetitions provided for each animal name was introduced on the basis of pilot work, in order to minimize floor and ceiling effects. Note that this variation introduces no problem in analysis, since our interest did not involve comparing the absolute total scores of readers vs. nonreaders. All we were interested in was the relative number of names recalled from the rime lists vs. those recalled from the body lists, so we were careful to control that each individual child received the same number of name repetitions for each of the 12 lists presented.) On the final pass through the pictures on a given sheet, the child was then asked to repeat the names for each picture in turn, without prompting, as the experimenter pointed to it. Each childÕs score was calculated as the total number of names correctly recalled out of the 20 on each set of lists. Scores were tabulated separately for nonreaders and readers on the List Recall Test. Each childÕs score represents the number of names remembered out of the total of 20 for all six of the lists presented in each category. Overall, the 24 English-speaking nonreaders averaged 11.8 correct from the rime lists and only 9.0 correct from the body lists, a difference which a t-test shows to be significant at the level of p < .001. Similarly, the 21 readers averaged 16.6 correct on rime names and 15.0 correct on body names, a difference that is significant at p < .003. For both literate and nonliterate English-speaking children, therefore, the presence of shared final VC (or rime) elements made an arbitrary set of nonsense CVC syllables easier to remember than did the presence of shared initial CV (or body) elements for the complementary set. For the English-speaking children, the rime unit was more salient than the body unit. Although only 34 children have been tested so far on the Korean version of the List Recall Test, the results already seem quite clear, with Korean readers and nonreaders both performing very differently from their English-speaking counterparts. Specifically, the 15 Korean nonreaders averaged only 7.4 correct recalls from the rime lists but 10.5 from the body lists, a difference which is significant at p < .001. The 19 Korean readers did much the same, averaging 11.4 on the rime names and 13.6 on the body names (p < .001). The key finding here, of course, is the one from the Korean nonreaders, for whom the body stands out as the most salient unit. This distinguishes them from the English nonreaders, who did better with rhyming names than body-sharing names. And since the Reading Test indicates that neither group knew how to read, it seems that this result follows as a significant difference between the two languages, and not one that can be viewed as a mere consequence of differences in their writing systems. 5. SUMMARY AND CONCLUSIONS In conclusion, we have for the first time found clear indications that preliterate English and Korean children mirror the performance of their older, literate companions. Specifically, whether influenced by orthographic conventions or not, English speakers are seen to segment syllables into onset (C) and rime (VC) constituents, while Korean speakers break them into body (CV) and coda (C) units. To strengthen this finding further (particularly in the Korean case, which flies in the face of the notion of the universality of the rime), we plan not only to test larger numbers of preliterate children, but also to extend the testing to nonliterate adults and to split-literate bilinguals, as described earlier in this paper. As indicated in the Introduction, we also plan to extend the testing to Japanese and Chinese, in order to explore the rather different kinds of potential orthographic effects that may be at work in those languages. Any new results obtained by the time of the conference along either of these lines will be included in the oral presentation. page 95 ICPhS99 San Francisco
منابع مشابه
Explicit segmentation of speech based on fr
In the development of a syllable-centric Automatic Speech Recognition (ASR) system, segmentation of the speech signal into syllabic units is an important stage. In [1], an implicit algorithm is presented for segmenting the continuous speech signal into syllable-like units, in which the orthographic transcription is not used. In the present study, a new explicit segmentation algorithm is propose...
متن کاملThe locus of the orthographic consistency effect in speech recognition: a cross-linguistic study
To address this issue, orthographic effects must be considered as a function of the processing mechanisms tapped by the tasks. In auditory tasks involving metaphonological components, the influence of orthographic knowledge has been consistently observed when the relation between the phonological and the orthographic representations of the stimuli was manipulated. For example, judging that two ...
متن کاملFrom segmentation bootstrapping to transcription-to-word conversion
The mapping of a raw phonetic transcription to an orthographic word sequence is carried out in three steps: First, a syllable segmentation of the transcription is bootstrapped, based on unsupervised subtractive learning. Then, the syllables are grouped to word entities guided by non-linguistic distributional properties. Finally, the phonetic word segmentations are mapped onto entries of a canon...
متن کامل"blind" Speech Segmentation: Automatic Segmentation of Speech without Linguistic Knowledge
A new automatic speech segmentation procedure, called the \Blind" speech segmentation, is presented. This procedure allows a speech sample to be segmented into sub-word units without the knowledge of any linguistic information (such as, orthographic or phonetic transcription). Hence, this procedure involves nding the optimal number of sub-word segments in the given speech sample, before locatin...
متن کاملVerbal-Auditory Skills in 5-year-Old Children of Semnan/Iran in 2006
Introduction: This research was planned to determine some verbal-auditory skills (verbal-auditory short memory and phonological awareness) that have the closest relationship with speech and language development in 5-year-old children. Method: In this descriptive cross-sectional study, 400 children of pre-school classes affiliated to Education and Welfare organizations in Semnan city were select...
متن کاملDiminutives facilitate word segmentation in natural speech: cross-linguistic evidence.
Final-syllable invariance is characteristic of diminutives (e.g., doggie), which are a pervasive feature of the child-directed speech registers of many languages. Invariance in word endings has been shown to facilitate word segmentation (Kempe, Brooks, & Gillis, 2005) in an incidental-learning paradigm in which synthesized Dutch pseudonouns were used. To broaden the cross-linguistic evidence fo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999